Learning the lexicon from raw texts for open-vocabulary Korean word recognition

نویسندگان

  • Sungho Ryu
  • Jin Hyung Kim
چکیده

In this paper, we propose a novel method of building a language model for open-vocabulary Korean word recognition. Due to the complex morphology of Korean, it is inappropriate to use lexicons based on the linguistic entities such as words and morphemes in openvocabulary domains. Instead, we build the lexicon by collecting variable length character sequences from the raw texts using a dynamic Bayesian network model of the language. In simulated word recognition experiments, the proposed language model could find correct words from lattices of character candidates in 94.3% of cases, increasing the word recognition rates by 20.9%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

New word learning for spoken document processing through discovery of comparable texts from external resources

This paper presents a new out-of-vocabulary (OOV) word learning approach that dynamically extends the pronunciation lexicon and the language model for large vocabulary continuous speech recognition (LVCSR) in spoken document retrieval (SDR) systems. Based on the assumption that the graphemes as well as the n-gram statistics of the OOV words can be effectively learned from other contemporary or ...

متن کامل

Learning Sub-Word Units for Open Vocabulary Speech Recognition

Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. Previous work heuristically created the sub...

متن کامل

A hybrid language model for open-vocabulary Thai LVCSR

This paper investigates the use of a hybrid language model for open-vocabulary Thai LVCSR. Thai text is written without word boundary markers and the definition of word unit is often ambiguous due to the presence of compound words. Hence, to build open-vocabulary LVCSR, a very large lexicon is required to also handle word unit ambiguity. Pseudomorpheme (PM), a syllable-like sub-word unit specif...

متن کامل

The Effect of “Narrow Reading” on Learning Mid-Frequency Vocabulary: The Role of Genre and Author

This study investigated the effect of Narrow Reading (NR) on learning mid-frequency words. Vocabulary Size Test (VST) designed by Nation and Beglar (2007) was administered as the first pre-test to 196 students, from among whom 91 students whose vocabulary size ranged between 2100- 3500-word families, , became the target of this study and were randomly c...

متن کامل

Utilizing a noisy-channel approach for Korean LVCSR

Korean is an agglutinative and highly inflective language with a severe phonological phenomenon and coarticulation effects, making the development of a large-vocabulary continuous speech recognition system (LVCSR) difficult. Choosing a Korean orthographic word-phrase (eojeol) as a basic recognition unit leads to high out-of-vocabulary (OOV) rates, whereas choosing an orthographic syllable (eumj...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003